data_tibble <- read_delim(file = paste0(working_directory_path, "/data/final_data.csv"),
show_col_types = FALSE)
data_tibble <- data_tibble %>%
filter(Classification_2020 != ".")
data_tibble <- data_tibble %>%
mutate(Classification_2020 = factor(Classification_2020,
levels = c("L", "LM", "UM", "H"),
labels = c("Low", "Lower-middle", "Upper-middle", "High"),
ordered = TRUE)
)
data_tibble <- data_tibble %>%
mutate(
Continent = factor(Continent),
Region = factor(Region)
)
data_tibble <- data_tibble %>%
mutate(
Net_Migration_Rate = na_if(Net_Migration_Rate, ".") %>% as.double(),
Median_Age = na_if(Median_Age, ".") %>% as.double(),
Youth_Unemployment_Rate = na_if(Youth_Unemployment_Rate, ".") %>% as.double()
)
head(data_tibble)
## # A tibble: 6 × 8
## Net_Migration_Rate Median_Age Youth_Unemployment_R…¹ ISO Classification_2020
## <dbl> <dbl> <dbl> <chr> <ord>
## 1 27.1 23.5 35.8 SYR Low
## 2 15.5 37.2 NA VGB High
## 3 13.3 39.5 14.2 LUX High
## 4 13 40.5 13.8 CYM High
## 5 11.8 35.6 9.1 SGP High
## 6 10.6 32.9 5.3 BHR High
## # ℹ abbreviated name: ¹Youth_Unemployment_Rate
## # ℹ 3 more variables: Country <chr>, Region <fct>, Continent <fct>
Using ggplot2, create a density plot of the median age grouped by income status groups. The densities for the different groups are superimposed in the same plot rather than in different plots. Ensure that you order the levels of the income status such that in the plots the legend is ordered from High (H) to Low (L).
Comment briefly on the plot.
Answer
# Filter out non-finite values in the Median_Age column
filtered_data <- data_tibble %>%
filter(is.finite(Median_Age))
# Create the density plot
ggplot(filtered_data, aes(x = Median_Age, fill = Classification_2020)) +
geom_density(alpha = 0.5, color = "black") +
scale_fill_manual(values = colorblind_palette) +
labs(x = "Median age of population") +
theme_minimal() +
theme(legend.position = "top",
legend.title = element_blank(),
legend.text = element_text(size = 10),
axis.title.x = element_text(size = 12),
axis.title.y = element_text(size = 12))
The density plot reveals that low-income countries have predominantly younger populations, peaking around the 20-25 age range. As income status increases, the median age of populations also increases, with high-income countries showing the oldest populations, peaking around 40-45 years. This indicates a trend where higher-income countries tend to have older populations, while lower-income countries have younger populations.
Investigate how the income status is distributed in the different continents.
Answer
# Create stacked barplot of absolute frequencies
ggplot(data_tibble, aes(x = Continent, fill = Classification_2020)) +
geom_bar(position = "stack") +
scale_fill_manual(values = colorblind_palette) +
labs(x = "Continent", y = "Count", fill = "Income Status") +
theme_minimal() +
theme(legend.position = "top",
legend.title = element_blank(),
axis.title.x = element_text(size = 12),
axis.title.y = element_text(size = 12))
The absolute frequencies plot shows that Africa predominantly consists of low and lower-middle-income countries, reflecting its economic challenges, while Europe and North America are mostly high-income, indicating advanced economic development. Asia displays a diverse economic landscape with significant representation across all income categories. Oceania and South America show a mix of income levels, highlighting varying development within these regions.
# Create stacked barplot of relative frequencies
ggplot(data_tibble, aes(x = Continent, fill = Classification_2020)) +
geom_bar(position = "fill") +
scale_fill_manual(values = colorblind_palette) +
labs(x = "Continent", y = "Proportion", fill = "Income Status") +
scale_y_continuous(labels = scales::percent) +
theme_minimal() +
theme(legend.position = "top",
legend.title = element_blank(),
axis.title.x = element_text(size = 12),
axis.title.y = element_text(size = 12))
The relative frequencies plot makes it easier to compare the proportion of income statuses within each continent but obscures the actual number of countries. In contrast, the absolute frequencies plot clearly shows the total count of countries in each income status category, making it easier to see the magnitude but harder to compare proportions within continents. Thus, the relative plot is better for internal distribution comparison, while the absolute plot is better for understanding total counts.
# Create mosaic plot
mosaicplot(~ Continent + Classification_2020,
data = data_tibble,
color = colorblind_palette,
main = "Mosaic Plot of Continents and Income Status",
xlab = "Continent",
ylab = "Income Status")
The mosaic plot combines the strengths of both the relative and absolute frequency plots by showing both the proportions and the total counts of income statuses within each continent. It uses tile sizes to represent absolute counts and tile areas to reflect relative proportions, providing a comprehensive view of the data. This allows for easy comparison of both the distribution balance within continents and the magnitude of each category.
For Asia, investigate further how the income status distribution is in the different subcontinents. Use one of the plots in b. for this purpose. Comment on the results.
Answer
# Filter data to include only Asia
asia_data <- data_tibble %>%
filter(Continent == "Asia")
# Create stacked barplot of absolute frequencies for Asian subcontinents
ggplot(asia_data, aes(x = Region, fill = Classification_2020)) +
geom_bar(position = "stack") +
scale_fill_manual(values = colorblind_palette) +
labs(x = "Subcontinent", y = "Count", fill = "Income Status") +
theme_minimal() +
theme(legend.position = "top",
legend.title = element_blank(),
axis.title.x = element_text(size = 12),
axis.title.y = element_text(size = 12))
The absolute stacked barplot reveals that Western Asia has the highest diversity in income statuses, with significant representation in all categories, including a notable portion in the high-income category. Eastern Asia shows a strong presence of high-income and upper-middle-income countries, reflecting its economic development. Central Asia, South-eastern Asia, and Southern Asia are predominantly lower-middle-income regions, indicating more uniform economic status within these subcontinents.
Answer
# Filter out non-finite values
filtered_data <- data_tibble %>%
filter(is.finite(Net_Migration_Rate))
# Create parallel boxplots
ggplot(filtered_data, aes(x = Continent, y = Net_Migration_Rate)) +
geom_boxplot(outlier.colour = "red", outlier.shape = 8, outlier.size = 2) +
labs(x = "Continent", y = "Net Migration Rate") +
ggtitle("Distribution of Net Migration Rate by Continent") +
theme_minimal() +
theme(axis.title.x = element_text(size = 12),
axis.title.y = element_text(size = 12),
plot.title = element_text(size = 14, face = "bold", hjust = 0.5))
# Create parallel boxplots again,
# but hide the extreme negative outlier
ggplot(filtered_data, aes(x = Continent, y = Net_Migration_Rate)) +
geom_boxplot(outlier.colour = "red", outlier.shape = 8, outlier.size = 2) +
labs(x = "Continent", y = "Net Migration Rate") +
ggtitle("Distribution of Net Migration Rate by Continent\n(excl. Lebanon)") +
coord_cartesian(ylim = c(-30, 30)) +
theme_minimal() +
theme(axis.title.x = element_text(size = 12),
axis.title.y = element_text(size = 12),
plot.title = element_text(size = 14, face = "bold", hjust = 0.5))
To facilitate the evaluation of the plot, Lebanon was removed in the second analysis because its extremely low net migration rate skewed the scale, making the data from the other continents harder to discern.
The boxplot shows net migration rate distributions across continents, excluding extreme negative outliers. Asia and Africa have the most variation, with many outliers, indicating diverse migration dynamics. This suggests different countries experience varying economic, political, and social conditions. Europe, North America, and Oceania have more centralized distributions with fewer outliers, reflecting stable migration trends due to established economic conditions and policies. South America shows a narrower range and fewer extreme values, indicating more uniform migration trends within the continent.
# Filter data for Asia and remove non-finite values
asia_data <- data_tibble %>%
filter(Continent == "Asia" & is.finite(Net_Migration_Rate))
# Identify the largest negative and positive outliers in Asia
largest_negative_outlier <- asia_data %>%
filter(Net_Migration_Rate == min(Net_Migration_Rate, na.rm = TRUE)) %>%
select(Country, Net_Migration_Rate)
largest_positive_outlier <- asia_data %>%
filter(Net_Migration_Rate == max(Net_Migration_Rate, na.rm = TRUE)) %>%
select(Country, Net_Migration_Rate)
# Display the results
print(largest_negative_outlier)
## # A tibble: 1 × 2
## Country Net_Migration_Rate
## <chr> <dbl>
## 1 Lebanon -88.7
print(largest_positive_outlier)
## # A tibble: 1 × 2
## Country Net_Migration_Rate
## <chr> <dbl>
## 1 Syria 27.1
The graph in d. clearly does not convey the whole picture. It would be interesting also to look at the subcontinents, as it is likely that a lot of migration flows happen within the continent.
Answer
# Filter out non-finite values
filtered_data <- data_tibble %>%
filter(is.finite(Net_Migration_Rate))
# Create parallel boxplots grouped by continent
ggplot(filtered_data, aes(x = Region, y = Net_Migration_Rate)) +
geom_boxplot(outlier.colour = "red", outlier.shape = 8, outlier.size = 2) +
labs(x = "Subcontinent", y = "Net Migration Rate") +
ggtitle("Distribution of Net Migration Rate by Subcontinent") +
coord_cartesian(ylim = c(-30, 30)) +
facet_grid(. ~ Continent, scales = "free_x") +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
axis.title.x = element_text(size = 12),
axis.title.y = element_text(size = 12),
plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
strip.text.x = element_text(size = 12, face = "bold"))
When splitting continents into subcontinents, the detailed breakdown reveals significant regional variations within each continent that are not visible in the broader view. Subcontinent-specific migration patterns and outliers become apparent, such as the distinct differences between Central and Western Asia or among the subregions of Africa.
The plot in task e. shows the distribution of the net migration rate for each subcontinent. Here you will work on visualizing only one summary statistic, namely the median. For each subcontinent, calculate the median net migration rate. Then create a plot which contains the sub-regions on the y-axis and the median net migration rate on the x-axis.
Answer
For each subcontinent, calculate the median youth unemployment rate. Then create a plot which contains the sub-regions on the y-axis and the median unemployment rate on the x-axis.
Answer
The value displayed in the barplot in g. is the result of an aggregation, so it might be useful to also plot error bars, to have a general idea on how precise the median unemployment is. This can be achieved by plotting the error bars which reflect the standard deviation or the interquartile range of the variable in each of the subcontinents.
Repeat the plot in h. but include also error bars which reflect the 25% and 75% quantiles. You can use geom_errorbar in ggplot2.
Answer
Using ggplot2, create a plot showing the relationship between median age and net migration rate.
Comment on the plot. Do you see any relationship between the two variables? D o you see any difference among the income levels?
Answer
Create a plot as in Task f. but for youth unemployment and net migration rate. Comment briefly.
Answer
Go online and find a data set which contains the 2020 population for the countries of the world together with ISO codes.
pop_data <- read.csv("../data/country_population_data.csv")
pop_data <- pop_data[c("Country.Code", "X2020")]
colnames(pop_data) <- c("ISO", "Population_2020")
full_data <- left_join(data_tibble, pop_data)
## Joining with `by = join_by(ISO)`
head(full_data)
## # A tibble: 6 × 9
## Net_Migration_Rate Median_Age Youth_Unemployment_R…¹ ISO Classification_2020
## <dbl> <dbl> <dbl> <chr> <ord>
## 1 27.1 23.5 35.8 SYR Low
## 2 15.5 37.2 NA VGB High
## 3 13.3 39.5 14.2 LUX High
## 4 13 40.5 13.8 CYM High
## 5 11.8 35.6 9.1 SGP High
## 6 10.6 32.9 5.3 BHR High
## # ℹ abbreviated name: ¹Youth_Unemployment_Rate
## # ℹ 4 more variables: Country <chr>, Region <fct>, Continent <fct>,
## # Population_2020 <dbl>
For the most part the join worked pretty well, I chose to use the “left_join” so that countries that are not in the original data but are in the population data will be ignored (else this would lead to rows where most of the values are N/A). Only 2 problems occured during the merge: For Kosovo, the original dataset had XKS as the ISO code, while the population set had “XKX” - this was easily corrected by changing the country code in the population data. For Taiwan, no population data was available on the “World Bank”-Site, since it is taken as a part of china \(\rightarrow\) no data is available for just Taiwan alone.
Make a scatterplot of median age and net migration rate for the countries of Europe. Scale the size of the points according to each country’s population.
scatterplot_data <- full_data %>% filter(Continent=="Europe")
ggplot(scatterplot_data) + aes(x=Median_Age, y=Net_Migration_Rate, size=Population_2020, alpha=0.7, color=Country) + geom_point() + theme(legend.position = "none") + ggtitle("Median age vs. Migration Rate in Europe")
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
The median age of most European countries is above 40 years. There also doesn’t seem to be any visible relation between the net migration rate and the median age, most of the datapoints are concentrated in a “blob” in the middle.
On the merged data set from Task k., using function ggplotly from package plotly re-create the scatterplot in Task l., but this time for all countries. Color the points according to their continent.
When hovering over the points the name of the country, the values for median age, net migration rate, and population should be shown. (Hint: use the aesthetic text = Country. In ggplotly use the argument tooltip = c(“text”, “x”, “y”, “size”)).
p <- ggplot(full_data) + aes(x=Median_Age, y=Net_Migration_Rate, size=Population_2020, alpha=0.7, color=Continent, text=Country) + geom_point() + theme(legend.position = "none") + ggtitle("Median age vs. Migration Rate, Worldwide")
pltly <- ggplotly(p, tooltip=c("text", "x", "y", "size"))
pltly
In parallel coordinate plots each observation or data point is depicted as a line traversing a series of parallel axes, corresponding to a specific variable or dimension. It is often used for identifying clusters in the data.
One can create such a plot using the GGally R package. You should create such a plot where you look at the three main variables in the data set: median age, youth unemployment rate and net migration rate. Color the lines based on the income status. Briefly comment.
library(GGally)
ggparcoord(data=full_data, columns=c(1:3), groupColumn="Classification_2020", title="Income-Class to Data Comparison, Worldwide")
The net migration rate and median rate seem to move up as the income-class rises for each country, while the youth unemployment-rate is more spread evenly between these classes (however most of the high-income countries still seem to have a pretty low youth unemployment rate.)
Using the package rworldmap, create a world map of the median age per country. Use the vignette to find how to do this in R.
library(rworldmap)
cdatamap <- joinCountryData2Map(full_data, joinCode = "ISO3", nameJoinColumn = "ISO")
## 215 codes from your data successfully matched countries in the map
## 3 codes from your data failed to match with a country code in the map
## 29 codes from the map weren't represented in your data
par(mai=c(0,0,0.2,0),xaxs="i",yaxs="i")
mapCountryData(cdatamap, nameColumnToPlot = "Median_Age")